Overview

The Secure Data Commons (SDC) provides access to novel data sets within the Department of Transportation for researchers and partner agencies. The SDC provides a platform for analysis efforts which integrate these novel data sets. To date, the ‘Commons’ aspect of the Secure Data Commons has not been fully utilized; this research project demonstrates an approach to integrating multiple data sets in the SDC. This document summarizes the work of the SDC Cross-project team in assessing how two of the SDC data sets can be usefully analyzed together.

Two data sets are used in this work:

  • FMI Data: The Freight Mobility Initiative (FMI) data is provided via an agreement between the American Transportation Research Institute (ATRI) and the Bureau of Transportation Statistics (BTS) of the USDOT. This data set includes GPS locations, speed, and direction of travel for anonymized trucks.

  • Waze Data: The mobile phone navigation app company Waze is providing data on crowdsourced traffic incident reports and additional system-generated data via an agreement with the DOT. These data include location and time of incidents, along with a number of characteristics of incidents, as well as location, extent, and speed of jams.

Data set characteristics:

Characteristic FMI Waze Notes
Spatial Range National National Waze data organized with state partitions
Available Time Range Since October 2018 Since March 2017 Gaps exist for some time periods
Volume of data 106 B GPS records 1.6 B alert records
3.4 B jam records
11 B jam point sequence records
As off late April 2020
Update frequency Bi-weekly API request every 2 minute Both have pipeline processing times, so not available to analysts immediately

Main takeaways

  • The data can be usefully integrated together. The two data sets provide complementary information on volume and characteristics of crowdsourced traffic incidents, and volume and characteristics of commercial trucking activity. Together, these data could be applied to a number of use cases in roadway safety, roadway system performance, and travel monitoring.

  • Through this effort, we developed general-purpose query and data processing scripts in GitLab which can be used by other teams. The code collaboration tools in the SDC provide a means for future analysts to replicate and build on the analysis presented here. This effort leveraged some existing work by other teams (BTS) who had started exploring the FMI data, as well as code from the Safety Data Initiative Waze pilot project.

  • Successfully integrating these two high-volume, high-velocity data sets requires a familiarity with the characteristics of the data, the data warehousing system, and best practices for geo-spatial analysis. Documentation exists in the SDC to provide much of this necessary information, but future users of these data would benefit from documentation which goes into more detail

  • The spatial and temporal patterns of the FMI and Waze data are distinct.

    • There are locations and times where truck activity is high but Waze activity is low, likely near distribution centers and commercial shipping routes.
    • The converse is also true; Waze data volume is highest at commuting hours, in more densely populated areas.
    • However, there is useful areas of overlap between these two data sets.
  • This effort demonstrates a scale-able, repeatable process for query, integration, and analysis of multiple data sets within the SDC.

Working effectively in the SDC

This assessment of multiple data sets in the SDC also serves as an opportunity to demonstrate how to work effectively in this secure, cloud computing environment.

The SDC support team provides a detailed user guide (available publicly here) for data analysts. This guide, along with instructions provided by the SDC support team, is sufficient for users to understand how to log on and access these data sets. However, beyond data access alone, competency in modern data science skills are required to be an effective user of the SDC. There are general data analysis skills which users should be familiar with before embarking on a project in the SDC.

General Data Analysis Skills Needed:

  • Working with data in the SDC requires at a minimum some familiarity with SQL (structured query language). SQL is a general-purpose query language familiar to users of any relational database.

  • Efficient use of the data is greatly facilitated by familiarity with a modern data science scripting language such as R or Python. These open-source languages have a wide variety of sophisticated data science tools, including machine learning, geospatial analysis, time-series modeling, data integration, and data visualization. The SDC provides support for both R and Python, on the default Windows machines as well as on optional Linux machines.

  • Code collaboration and version control is supported in SDC with a GitLab server.

    • This allows analysts to follow modern data science best practices by developing code in collaboration with teammates, keeping working versions of code available while developing new features. Familiarity with the git workflow, including the use of the GitHub platform, will facilitate this collaboration.
    • The data analyst user guide provides instructions on how to enable GitLab code repositories, but analysts need to bring their own knowledge of how to effectively set up and manage the repositories for their needs.

In addition, beyond these data science skills, effective use of the SDC is facilitated by the following skills.

  • Understanding where data exist and how to share files.
    • Data Warehouse: Each data set exists in a data warehouse which can be queried by users. Users need to understand the basics of how to connect to a data resource (typically a relational database) for data query.
    • Team S3 buckets: Data and file sharing with a team is achieved with the cloud data storage service from AWS, S3 buckets.
    • Local instance: Analysts work on cloud computing instances, which will feel familiar to anyone who has logged on to a computer remotely.
  • Planning a project in a secure computing platform.
    • Adding new tools: The benefit of open source tools like Python and R is the large user community and active development of additional analysis packages. Users should know how to install additional tools to augment their analysis workflows.
    • Accessing existing code: The GitLab platform referenced above houses multiple repositories of code which could be useful for projects. Users will benefit from an understanding of how to search for and use the existing code repositories.
    • Uploading user-supplied data: Users can upload additional datasets, like geospatial layers or other contextual data, via the team S3 bucket. Users need to understand the concepts of the S3 bucket upload and download to take advantage of this feature.
    • Export request process: The security of the SDC is achieved by a controls on the data export process. Users need to understand the requirements of the export request process, and tune their workflows to only request exports of sufficiently derived data that meets the data sharing agreements of the relevant data sets.
  • Optimizing the data analysis workflow
    • Optimize code: Any process can be improved on, and understanding how to improve a process like querying a database in SDC is an important skill. Making a process work once may be sufficient, but for repeated processes, users will want to invest some time in code review and optimization.
    • Cloud instance management: A substantial advantage of the SDC is the ability to change workstation instance configuration, namely to increase CPUs and RAM as needed for large tasks, then reducing when not needed. Users will benefit from an understanding of what processes require or can be optimized by such cloud instance configuration. In addition, users also can benefit from understanding how to make a process run in parallel, to take full advantage of multiple CPUs.
    • Working with SDC team to make changes to data warehouse: Through the process of working with data in the SDC, users may encounter opportunities for optimizing the data warehouse itself. The SDC team is capable of enacting data warehouse optimization when necessary.

Some further details on how these skills are employed in the SDC follow in the section below.

Establishing SDC Workflow

Data Access Permissions

Both data sets are accessed in the SDC, stored in separate data warehouses. In order to access the data, an analyst needs not only SDC credentials, but also permissions specific to the data warehouses. Data access request forms exit for each data set, and these requests are reviewed by the respective data stewards within USDOT. The data access request forms detail the responsibilities of the data analyst for secure handling of the data, as well as restrictions on how the data can be applied.

The Cross project team requested and was granted permission to access these data sets within SDC.

The documentation of the datasets themselves is relatively scant currently. Meta-data on the schema, units, and other context about the variables in each data set are not currently sufficient for a user with no prior knowledge of the data to work with the data effectively.

Workstation Setup

Within SDC, there are multiple workstation configurations possible. The default workstation is a Windows machine (as an AWS EC2 instance), with a number of open-source data science tools installed: SQL Workbench, Anaconda, RStudio, and Git for Windows are being the tools used for this project.

In addition, users have the option to request other workstations. For this project, the data querying and analysis was all done using a Linux-formatted workstation. Some open-source geospatial tools are easier to install on Linux machines, and using development interfaces like Jupyter Notebooks for Python and RStudio Server for R allows a seamless user interface. Both Linux and Windows workstations can be re-sized as needed in the SDC. For large spatial analysis tasks, the processing can be parallelized and the instance can be configured with a large number of CPUs, facilitating rapid analysis workflows.

Each project in SDC also has an AWS S3 bucket set up. The S3 bucket acts as a cloud storage drive for any type of file. For this project for example, non-SDC data inputs (geospatial layers from the US Census Bureau). File exchange between the S3 bucket and individual analyst workstations is accomplished with command line tools.

Repository Setup

The code for this project is written in Python and R, and was contributed by multiple team members. To collaborate and version-control the code, we established a repository on the GitLab platform in SDC. We also drew on some general-purpose code written in Python for SDC users to query the FMI data, as well as code written in R for another project in the SDC involving spatial analysis and mapping using open-source tools. SQL code was written for the data store queries, in some cases adapting template queries provided by the SDC enablement team.

The code for this project is version-controlled in the FMI_Waze repository on the GitLab service within SDC.

Methods

Given the vast scale of these data sets, this assessment selected one region of the country, for one month, namely eastern Massachusetts for September 2019. This allows a proof-of-concept assessment of how these data can be joined and analyzed in concert. The process involved the following steps:

Querying the data warehouse

Both the Waze and FMI data can be queried using standard SQL syntax. The Waze data are warehoused in an Amazon Redshift cluster, while the FMI data are warehoused in a Hive/Hadoop cluster. In practice, the query syntax is similar for both: query the database for a date range and general spatial area, then perform more detailed spatial analysis on the results from the query. In the course of developing the code for this project, we encountered performance issues with the Hive/Hadoop cluster. We worked with the SDC developer team to test improvements to the database structure, which the SDC team has implemented.

The ‘two-step’ query process (first query a general area in SQL, then perform more detailed spatial data processing on the results) could in theory be simplified to one step. However, this would involve putting much more computational load on the data warehouse infrastructure, as opposed to on a data analyst’s own EC2 compute instance in the two-step process. The first spatial step is similar for both: querying for a date range (2019-09-01 to 2019-09-30) and a latitude and longitude range that encompasses eastern Massachusetts (or any region of interest). The second step is also the same: convert to spatial data frames, apply the same geographic projection on the latitude and longitudes, and clip to a shapefile for the state of Massachusetts using standard, open-source geospatial analysis libraries.

Spatial joining

The method chosen to join two spatial datasets depends on the required analysis. In this section, we will summarize two potential methods for spatial joining in the project and the factors that were considered in choosing a method. The two most meaningful options for this project were to join the datasets by road segments, or to join by a common grid structure. These two approaches are summarized below:

  • Road segment
    • Joining points to road segments results in a data structure where each road segment is an observation, with attributes for number and characteristics of commercial motor vehicles, and number and characteristics of traffic alerts. Each time step would add one additional row for each road segment.
    • The Highway Performance Monitoring System (HPMS) provides a road network for the National Highway System roads. While convenient to use for spatial analyses, it is limited because it does not include most lower functional class roads.
    • Each state maintains their own detailed road network. It would be possible to access a road network for this study area (Massachusetts), but the methods of spatial joining would need to be adapted to apply to other states.
    • Any analyses need to take into account the different lengths and characteristics of the road segments; correctly joining points to road segments requires careful work to assign points correctly to the right lanes and directions (applicable to typically only to higher functional class roads).
  • Grid or Polygon
    • Joining points to road segments results in a data structure where each grid cell or polygon is an observation, with attributes for number and characteristics of commercial motor vehicles, and number and characteristics of traffic alerts. Each time step would add an additional row for each grid cell or polygon.
    • Joining to a county, census block, or other political boundary is straightforward and would facilitate use of other characteristics of that political unit, including population or socio-economic data like the Longitudinal Employer-Household Dynamics (LEHD) data from the Census Bureau.
    • Joining to a grid cell is likewise straightforward. By providing an equal-area grid cell to join to, spatial analysis is greatly facilitated, as area does not need to be used as a co-variate.

For most safety applications, it is ideal to work at the road segment level, since most safety interventions are applied at this level. For our initial work, we elected to use a grid to facilitate rapid analysis to compare the two data sets in conjunction with each other. We joined all point data (FMI and Waze) to a grid layer of 1 square-mile hexagons. Hexagonal tessellations are appropriate because they better represent underlying linear features like roads, while minimizing the data artifacts associated with edge areas (ESRI reference). The area of eastern Massachusetts selected here encompasses 3,468 square miles. For the month of September 2019, data on 5,710,425 unique trucks by hour were included from the FMI data, as well 460,165 unique Waze events by hour.

Results

Mapping

The following maps show aggregated counts of unique truck IDs per hour and counts of Waze events per hour. The spatial unit is 1-square mile hexagonal grid cells. A road segment approach using the HPMS network or a more detailed state-provided network is possible in the future.

These maps further aggregate the count of each dataset to daytime (7 am to 7pm, Eastern) and weekend/weekday time periods for ease of comparison.

Weekdays

Comparing daytime to nighttime, the FMI data show consistent patterns but less total volume in the evening. The mean count of unique truck IDs per hour reaches as high as 106. High values are consistently observed along the interstate 495 and interstate 90 routes (Google Map of the area). While interstate data is most prominent in the FMI dataset, there is data on truck activity across the entire study area.

When clicking the Waze Alert Counts tab, a strikingly different pattern emerges. Note the scale differs between the FMI and Waze tabs. The dense cluster of Waze alert activity in the metro Boston area is maintained in both daytime and nighttime, but much less activity is observed in the nighttime. Note the scale differs between the FMI and Waze tabs.

FMI Truck Counts

Waze Alert Counts

Weekend

The patterns on the weekends are similar for both FMI and Waze. Note that the scale is the same for the weekday plots within each data set (FMI and Waze), but remains different across data sets. The FMI data show distinct patches of high activity on weekends, which may represent distribution centers or other warehouses. Waze activity is reduced on weekends, and distinct patches of high activity likewise emerge.

FMI Truck Counts

Waze Alert Counts

FMI Truck Speeds

Note these maps compare only weekday and weekend time periods for mean speeds of unique truck IDs within a grid cell. The mean speed of a given truck ID was first calculated. The values in these maps are the overall averages of the mean speeds of all the trucks. Truck speeds are generally lower in the Boston metro area, with the interstate areas contributing some of the highest speed. Speeds generally retain the same characteristics on weekdays and weekends.

Weekday

Weekend

Volume relationships

We additionally assessed the relationship between the two data sets in terms of the volume of data. A high correspondence in data volume indicates that when analyzed by equal-area grid cell and hour, generally high activity in one stream of data is matched by high activity in the other. Each point in the figures below represents one grid cell. Values are the sum of distinct truck IDs or unique Waze alerts for the specified time frame (weekend or weekday, daytime or nighttime) in that grid cell.

Since both axes represent highly skewed data (few, very large values), it is also useful to view these relationships on a logarithmic scale. This demonstrates close correspondence between the volume of FMI and volume of Waze alert data when accounting for the distribution of the values.

Waze jams and truck speeds

The following plot and table demonstrate that the two data sets can be combined in useful ways. When at least 3 jam reports are present in the Waze data, there is a dramatic drop in the speeds of trucks in the FMI data. The boxplots below group truck speeds into five categories, and the y-axis represents the mean count of distinct truck IDs per grid cell for that time period and number of jams. Note that ‘low jams’ represents most of the data, 92,675 grid cells x weekend/weekday time period, while ‘high jams’ is a less common condition, only 8,465 grid cells x weekend/weekday time period.

The same data are summarized below in tabular form for quick reference.

Summary of FMI and Waze data by weekend/weekday and high Waze jam counts
weekend high_jams Number of grid cell x time combinations Median FMI truck speeds Sum FMI truck counts Median Waze jam reports Sum all Waze reports
Weekday High Jams 6,593 15 435,550 10 232,675
Weekday Low Jams 55,529 28 4,417,368 0 171,390
Weekend High Jams 1,872 31 26,228 6 25,568
Weekend Low Jams 37,146 34 831,279 0 42,885

Next Steps

Next steps will be considered in consultation with the project manager and stakeholders. We consider the following to represent potentially fruitful next steps:

  • Pandemic response
    • The FMI data represents an unprecedented opportunity for USDOT to monitor and understand commercial motor vehicle activity. The COVID-19 pandemic is a natural experiment on all aspects of society. In particular, it is affecting the trucking industry in distinct ways. Some states saw spikes at the outset of the pandemic, followed by dips of varying depths in trucking activity (ATRI analysis).

    • Combining the Waze and FMI data would provide a way for the Department to understand and monitor the distinctive behaviors of commuters versus commercial motor vehicles as our country recovers from the pandemic.

  • Travel time / Mobility / Energy
    • System-generated jam files from Waze include the spatial extent as well as severity of the jam. Looking at the influence of jams on freight travel from FMI could be fruitful.

    • Fuel efficiency and truck platooning are possible to investigate using the FMI data, since unique truck IDs are available with GPS pings at high frequency.

  • Safety
    • Crash predictability (with police crash report data). The Waze data has been successfully used to understand and predict crash frequency at broad spatial scales. Combining both Waze and FMI could support a crash prediction tool for commercial motor vehicles more effectively than using either one of the data sets alone.
  • Assessment of data and validation
    • WYDOT Speed data to NPM-RDS. The Connected Vehicle Pilot in SDC has detailed vehicle speed data from Wyoming Department of Transportation. These data could be usefully validated against speed profile data for NHS road segments in the National Performance Management Research Data Set (NPM-RDS) from FHWA.

    • TMAS to FMI. The Traffic Monitoring Analysis System (TMAS) from FHWA provides ground-truthed traffic count and vehicle classification data at over 8,000 locations nationwide. Joining TMAS and FMI data would allow an understanding of the percent of all trucks which are represented in the FMI data, and how this percent representation varies by space and time.

    • TMAS to Waze (under SDI). The Safety Data Initiative is now undertaking an analysis of the Waze data in the context of the TMAS data as noted above for FMI. This will provide confidence intervals for how much of the traveling public is represented in the Waze data.

Areas for Improvement for SDC

This project allowed an opportunity to review areas of improvement for the SDC to better enable similar analysis of multiple data sets. Suggestions for the SDC team:

Data Documentation

  • Documentation of where data exist at different points in the pipeline is not obvious. The data pipeline in SDC involved ingestion of the raw data from the provider, curation into a format which can be used by analysts, and persistence into the relevant data warehouse (Hadoop for FMI, Redshift for Waze).
  • Numerous QA/QC steps are performed along the data processing pipeline, but these aren’t documented in a way which is visible to analysts. Documentation of these steps will build confidence in the data.

Dataset Completeness Visualization

Through this cross-project work, it became clear that there is a need for basic metrics to capture data completeness. Suggested approaches:

  • Simple dashboard of data quality over time for each data set, reflecting gaps in data by day, if any, as well as any warnings from the automated data quality tools (Canary).
  • Record of updates to data sets made visible to users. For example, Waze alert data for April - December 2019 had gaps before 2020-05-15, but was fixed after 2020-05-15.

Demonstrations of analysis steps

While the Data Analyst user guide does provide basic steps for query of the databases, not all users have access to all data sets. Therefore, it could be very beneficial to have an an example data set drawn from a public source, available to all users within the SDC, for demonstration purposes. For example, publicly available Traffic Volume Trends data from FHWA could be brought in to SDC and put into the Hadoop cluster, and US Census Bureau shapefiles at the county level (example) could be stored on a demonstration S3 bucket. This could then serve as an example data set for a Code Academy-style demonstration of how to do the following:

  • Access Hadoop and perform a query
  • Access S3 and download a file to the analyst workstation
  • Perform simple spatial overlays and data analysis, and then generate results locally
  • Save results to S3 and export